Clearly define a problem or an idea of your choice, where you would need to leverage the Foursquare location data to solve or execute. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem.
New York is said to be one of the most exciting cities in the world. Within an area of around 300 square miles live 8.40 million people making New York also one of the most crowded places in the US. Being the dream destination for immigrants from all over the globe, New York is having an almost infinite number of restaurants serving every cuisine one can image. Even though an overall GMP of about two trillion US dollars seems like a huge potential for businesses, the market in New York is already fairly saturated and high real estate prices leave no space for the weak, at least not in the hot spots of the city where the money lies.
For a group of investors I analyze the restaurant scene in the city "that never sleeps" to check the potential for a new German restaurant and get a first idea of what would be potential locations for such a restaurant where it would not hit too much direct competition.
Describe the data that you will be using to solve the problem or execute your idea. Remember that you will need to use the Foursquare location data to solve the problem or execute your idea. You can absolutely use other datasets in combination with the Foursquare location data. So make sure that you provide adequate explanation and discussion, with examples, of the data that you will be using, even if it is only Foursquare location data.
I will join dataset with the geo coordinates of neighborhoods and boroughs in New York with venue data from Foursquare to analyze the competition in New York:
City data: https://geo.nyu.edu/catalog/nyu_2451_34572 Geo coordinates to locate the boroughs and neighborhoods in New York City.
Datasource: Geopy I use the geopy package to obtain the latitude and longitude values of New York City for maps.
Datasource: Foursquare API The Foursquare API is utilized to obtain the locations and categories of food venues in New York.
I will use the above mentioned data sources for maps and exploratory data analysis to give an overview over the competition and to identify potential spots for a new German restaurant in New York.
To get started, I first need a dataset that contains the basic information about New York, the coordinates of the neighborhoods in New York that I want to check for the best place to open a new German restaurant.
# Install necessary packages if necessary
#!conda install -c conda-forge geopy --yes # uncomment this line if necessary
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if necessary
# Import necessary packages
import numpy as np
import pandas as pd
import requests
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
import lxml
import json
import wget
from geopy.geocoders import Nominatim
from bs4 import BeautifulSoup
import folium
print('Libraries imported.')
New York has a total of 5 boroughs and 306 neighborhoods. In order to identify those neighborhoods in the Foursquare database later, I need a a data set that contains the 5 boroughs and the neighborhoods in them together with the geodata (latitude and logitude coordinates). I download a GEOJSON file with the information from the website https://geo.nyu.edu/catalog/nyu_2451_34572.
The relevant information in the GEOJSON file is a list of the neighborhoods in New York. The informaiton is stored in the features key. To transfer that information in a Pandas dataframe for the data analysis, I first store the information in a variable.
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')
with open('newyork_data.json') as json_data:
newyork_data = json.load(json_data)
neighborhoods_data = newyork_data['features']
display(neighborhoods_data[0])
In the next step I simply transfer the variable into a pandas dataframe.
# Definition of the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']
# Creating an empty dataframe
neighborhoods = pd.DataFrame(columns=column_names)
# Filling the dataframe row by row
for data in neighborhoods_data:
borough = neighborhood_name = data['properties']['borough']
neighborhood_name = data['properties']['name']
neighborhood_latlon = data['geometry']['coordinates']
neighborhood_lat = neighborhood_latlon[1]
neighborhood_lon = neighborhood_latlon[0]
neighborhoods = neighborhoods.append({'Borough': borough,
'Neighborhood': neighborhood_name,
'Latitude': neighborhood_lat,
'Longitude': neighborhood_lon}, ignore_index=True)
# Check the dataframe
display(neighborhoods.head())
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
len(neighborhoods['Borough'].unique()),
neighborhoods.shape[0]))
Now that I have a complete list of the neighborhoods in New York, let's have a look at a map of the city and the neighborhoods in it.
# Get the geographical coordinates of New York.
address = 'New York City, NY'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
ny_latitude = location.latitude
ny_longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(ny_latitude, ny_longitude))
# Create a map of New York
map_newyork = folium.Map(location=[ny_latitude, ny_longitude], zoom_start=10)
# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
label = '{}, {}'.format(neighborhood, borough)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(map_newyork)
map_newyork
To understand the competition in New York, I add the location of restaurants to the dataset.
# Define the Foursquare credentials
CLIENT_ID = 'AHMWCTGMOTR4AT1EGAYGJLAPUPBDDL1QUOK5RA4BQLNONKU3' # your Foursquare ID
CLIENT_SECRET = '4EPIVL5APO5PAZGJCWRCYHD24S5M1JBWDVS4U52H43ESB3H1' # your Foursquare Secret
VERSION = '20200605' # Foursquare API version
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
# Define a function to get information of food venues for all neighborhoods
# I use the explore query to get a list of the top 100 food venues in a 500m radius around for each neighborhood
# Function for all nearby food venues
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100, section="food"):
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
print(name)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}§ion={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT,
section)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
# Run the function to collect the information for food venues from Foursquare Places API
NY_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
latitudes=neighborhoods['Latitude'],
longitudes=neighborhoods['Longitude']
)
display(NY_venues.shape)
display(NY_venues.head())
display(NY_venues.groupby('Neighborhood').count())
print('There are {} different types of food venues.'.format(len(NY_venues['Venue Category'].unique())))
# Function for nearby German restaurants venues
def getNearbyGerVenues(names, latitudes, longitudes, radius=500, LIMIT=100, categoryId="4bf58dd8d48988d10d941735"):
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
print(name)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT,
categoryId)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_ger_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_ger_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_ger_venues)
# Run the function to collect the information for German restaurants from Foursquare Places API
NY_ger_venues = getNearbyGerVenues(names=neighborhoods['Neighborhood'],
latitudes=neighborhoods['Latitude'],
longitudes=neighborhoods['Longitude']
)
display(NY_ger_venues.shape)
display(NY_ger_venues.head())
display(NY_ger_venues.groupby('Neighborhood').count())
print('There are {} different types of German restaurants.'.format(len(NY_ger_venues['Venue Category'].unique())))
#Create a pandas dataframe that contains aggregated information for each neighborhood
# Calculting the total number of food places and merge with the dataset using neighborhood as indicator
No_of_food = NY_venues.groupby('Neighborhood').count()
No_of_food = No_of_food.drop(columns=['Neighborhood Longitude', 'Neighborhood Latitude', 'Venue Latitude', 'Venue Longitude', 'Venue Category'])
No_of_food.rename(columns={"Venue" : "Food venues total"}, inplace=True)
display(No_of_food)
No_of_german = NY_ger_venues.groupby('Neighborhood').count()
No_of_german = No_of_german.drop(columns=['Neighborhood Longitude', 'Neighborhood Latitude', 'Venue Latitude', 'Venue Longitude', 'Venue Category'])
No_of_german.rename(columns={"Venue" : "German restaurants"}, inplace=True)
display(No_of_german)
#Creating a new dataframe using the neighborhoods dataframe as a basis
df_results = neighborhoods
df_results=df_results.set_index('Neighborhood')
df_results.head()
df_results.shape
df_results2 = df_results.join(No_of_food, on='Neighborhood')
df_results3 = df_results2.join(No_of_german, on='Neighborhood')
df_results3['Food venues total'].fillna(0, inplace=True)
df_results3['German restaurants'].fillna(0, inplace=True)
df_results3['Perc. German restaurants'] = (df_results3["German restaurants"]/df_results3["Food venues total"])
display(df_results3.head())
print(df_results3.shape)
Now that we have a dataset that contains all relevant information, we can search for those neighborhoods with a higher than average number of food venues and a low number or zero German restaurants.
I will analyse the data to get an overview over the competition in New York and look for the best spot to open a new german restaurant.
# Create a map of New York showing the locations of best rated restaurants in each neighborhood
map_newyork2 = folium.Map(location=[ny_latitude, ny_longitude], zoom_start=10)
# add markers for restaurants to map
for lat, lng, venue, category in zip(NY_venues['Venue Latitude'], NY_venues['Venue Longitude'], NY_venues['Venue'], NY_venues['Venue Category']):
label = '{}, {}'.format(venue, category)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(map_newyork)
map_newyork